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Abstract- The majority of farmers in developing countries are unable to reach farming informatiW^icl 
knowledge, and so they often rely on rudimentary methods. To enhance the agricultura^Ke^ind 



productivity, one must ensure the flow of information to farmers; that is to bridge an agriAflfcrfl info 
channel to farmers effectively. In this work, we move on to build an agricultural IVR systefci wiB^a mission 
of providing agricultural info services to the farmers. However, since the underlying t^^mqiies resort to 
statistical approaches, a large amount of training material is demanded to reach a /asfcuBable level of 
performance, especially on a new applied domain. Unfortunately, the required ofSofc^o develop such 
corpora is both costly and time consuming, and large scale acquisition campaigns&^ht not be feasible. 
Under these restricted circumstances, subspace Gaussian mixture models and p^3kj$rer synthesis are the 
keys to build ASR and ITS components of an IVR system. Experimental results co^/m the hypothesis with 
3.43% winning over conventional methods in an application of agricultur^^flfc^siion. 

I. Introduction 

The telephone — whether landline or mobile — is often the ha^^l^lTser-friendliest access device. As a 
result, people can increase access to their business services and^^lications by existing speech-enabling 
applications [1]. With the support achieved from spoken Jlhgi^ge processing (SLP) techniques, these 
applications can be promoted into a new type of humaiWjS^!%ction: interactive voice response (IVR) [1]. 
Users can interact with an IVR system (technically GjlQ^i^oice server") as if it were a conversational 
partner. 

However, telephone-based interactions pose seve?S(^esearch challenges [2]. For example, telephone speech 
is often hard to recognize and understand dj^^fa the reduced channel bandwidth and the presence of noise. 
In addition, voice-based interaction relie^cr^nly the human auditory channel to receive the information, 
and thus potentially increases the co^nra^load. Furthermore, real-time performance is necessary, since 
prolonged delay over the phone c#)i«^tifte annoying to users and render the system unusable. 






Voice Server has been insenaja^ly researched for a long time. In 1997, Victor Zue et al [2] had begun to 
develop JUPITER, a converga&jjg^ interface that allows users to access and receive on-line weather forecast 
information for over 5^0 (fcrei worldwide over the phone. In addition, IBM [1] has successfully developed an 
enterprise speech s^uSgft; named IBM WebSphere Voice Server, which provides voice-enabled 
applications to gij^At^ustomers, employees and suppliers more flexible access to information and 
services. In Vie^an^^urrent services provided by the contact centers are mostly under-run by manpower 
or through^nfeN^ protocol. In 2010, R&D group from AILab has proposed a Vietnamese spoken dialog 
system foi^M^nquiry of stock information over the phone with the best accurate rate of 87.3% [3]. 
HoweveyVte^ystem was just built to process only stock ticker symbols and users were not required to 



>ffli 



spe^kjfe^rally. Since then there was no application of voice server in the Vietnamese industries, and its 
r^^»is still on hung. 

leanwhile, in developing countries, nearly 1.5 billion people live without electricity [13] and 752 million are 
iterate [14] - two constraints that make accessing information challenging. To exacerbate this problem, 
the majority of these people live in rural areas [13], which are often hard to reach because of inadequate 
roads. Information about farming techniques is particularly important because agriculture is a major source 
of livelihood for most rural people though they often rely on rudimentary methods [15]. To enhance the 
agricultural life, one must ensure the flow of information to farmers; that is to bridge an agricultural info 
channel to farmers effectively. Putting on this mission, we have the statement of agricultural extension [16]. 
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Conventional approaches for agricultural extension, like extension workers and infome diaries, serve their 
roles well. But the current trend of ICT services is rising and proven to be more efficient [15]. As SLP 
technologies advance, the IVR systems are greatly enhanced, allowing for natural speaking style, domain 
adaptability, and robust speech recognition [9]. Thus, taking the ripe fruit, we brought our IVR framework 
to provide an automated information channel for agricultural extension - that is, an automatic call-center 
answering questions on farming techniques. Figure 1 illustrates a typical dialog session between a farmer 
and the system. With the agriculture domain, its ASR engine needs to be rebuilt in order to maintai^i 
sustainable recognition performance. This involves in collecting corpora for the applied domain a' 
adjusting the models. However, after all the hard tasks, recognition accuracy only satisfies a feasibl^jevejj 
errors remain quite high. The problem originates from the amount of training/adapting data availajjlw^j 



-V— 



At 

4»Q 



"The ravage of borers?" 



"Hello, welcome to 

What do you want to ask?" 



"Did you say 'borer'?" 



""Yes.' 



■ "The information regarding borer is. . ."— 



"Anything else?"- 



""No f thanks.' 



Figure 1 . An agricultural session of human-machine dialo, 
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Figure 3. Overview of Asterisk architecture 



he degradation of ASR performance. In popular languages like 
;e amount of speech corpus resources for system development. But 
languages such as Malay or Vietnamese, this is not the case, 
to develop speech corpora is both costly and time consuming, 
sition campaigns are simply not feasible. These languages are referred to by 
guages" [7]. Amongst state-of-the-art techniques, Subspace Gaussian Mixture 
be effective against under-resourced circumstances. Thus we resort to its 
g the IVR system's ASR component, targeting on agricultural extension service. This 
g the work of [9], altering its ASR and TTS engines to comply with the under- 
>urcedirtta8itK)n. Section II presents the system architecture and its components, while Section III gives 
erimtfrol results. Finally, Section IV concludes the paper. 



II, The IVR System and its Components 



is section describes the Agricultural IVR system. It is responsible for answering incoming calls of a 
pecific agricultural query. Figure 2 illustrates the four main modules composing the system: an Asterisk 
PBX Server, a speech recognizer, a speech synthesizer, and a dialog manager. The Asterisk server manages 
telephone signal transmissions between users and the system over PSTN network, while dialog manager 
executes the tasks of query and processing information. Both the speech recognizer and synthesizer operate 
as a communication layer by dealing with speech-to-text conversions and vice versa. Incoming queries will 
then be interpreted and responded appropriately. 
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Each of the following subsection will describe the system's components and their underlying technologies. 

A. Asterisk 



Asterisk is a free software implementation for telephone private branch exchange that transforms a 
computer into a communication server and can be used as a telephony engine and application toolkit. It is 
also a framework that allows selection and removal of particular modules, allowing us to create a cust^rn^^^ 
telephony system [4]. Asterisks well-thought-out architecture gives flexibility for creating custom modur^^^ 
that extend our phone system, or even serve as drop-in replacements for the default modules. Mlteris^ 
flexibility allows it to be deployed as PBX, VoIP, IVR as well as Voice Mail system [5]. A PBX is^l^ym 
which allows one telephone to make connection with other telephone and telephone services. Cfceylry, it 
can interoperate with almost all-standards-based telephony equipment using comparat^el^lkiexpensive 
hardware which makes it easier to connect with traditional telephony network as well a^llfcioi^s computer 
networks. This way, Asterisk PBX server can be added with a couple of new functjprWumes. The new 
functionalities can be added by writing dial plan scripts in some Asterisks own ext^fLpii^anguages or by 
including custom loadable modules written in C or by implementing the Asterisk^S^ray Interface (AGI) 
programs using any programming language like Perl, python, shell scripts, etc? /^Figure 3 depicts the 
Asterisk architecture. ♦ 

The heart of any Asterisk system is the PBX core. It is the essential ~ " J ~ 

calls. The core also takes care of other items like codec translator, scfce< 
and other modules. 

B. Speech Reco; 



l^l^ent that takes care of bridging 
*r and I/O manager, application, 



To cope with the problem of limited training data, SiJjloac! Gaussian Mixture Model (SGMM) acoustic 
modeling techniques [11] are used. In contrast to theAi^approaches that deploy a set of universal phones 
to cover multiple languages, the approach of SGMfctXses distinct phone sets but shares a large amount of 
parameters across languages. In SGMM, HMM-sra*bs feature distributions are Gaussian Mixture Models 
(GMMs) with a common structure, consti^nW to lie in a subspace of the total parameter space. The 
parameters that define this subspace can HfcsXred across languages/domains. Formally defined, the feature 
distribution of a HMM-state j is given 




.6 



(1) 



where x is the feature ^|c^^|nd N(x; \i, is the Gaussian function. This might look a little similar to the 
conventional GMM, bowser, the difference lies in the way of representing mixtures. An intuitive 
illustration for bo|M^!aels can be seen in Figure 4. For SGMM, a particular state j is associated with a 
vector v which^ltMBmnes the means and weights as follows: 



Mji = MfV; 



(2) 



exp w t Vj 
^ exp wf,v 



(3) 



re Mi and Wi are shared across all state distributions. In addition, the covariance matrices £i are globally 
iared as well. Together, M i7 Wi and £i form the set of globally shared parameters, as opposed to the state- 
specific vectors Vj. 
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Figure 4. HMM structures. 



Figure 5. SGMM with sld^cfcat^ 



*J$(ut<; 

TMKrtioi 



To achieve a balance between the amount of shared and state-specific parameter^^timarotion of a "sub- 
state" [11] was introduced. Instead of just one state-vector, the feature distrihijtffl^For a state can be 
represented by a mixture of M vectors, each with its own weight c. Figure 5 give apyMearer picture on this 
notion. In this case, the feature distribution of a state j is given by: 



m=l i=l 
Mjmi^iVjm (5) 





(6) 



*pw r v jm 



Utilizing SGMM, one can deal with the problem off^^ted training data for under-resourced condition. 
Indeed, the set of globally shared parameters {l^i^^Ei} can be trained on out-of-domain data, while the 
state -specific vectors {v,} can be trained on a limitecPamount of in-domain data. In the experiments for this 
paper, broadcast news and agricultural telojmoT^ are selected as the targets for well-resourced and under- 
resourced domain respectively (i.e., thel5|oa?fcast news corpus serves as the out-of-domain data and the 
agricultural telephony corpus serves asff h^n- d omain data). 

C. Speech Synthesizer 

The original work [9] emplj^sWOS's corpus-based version [8] to power its TTS engine. This has the 
advantage of naturalnrcs fcrinfelligibility, but suffers from the oversize database (i.e., more than 4 GB/40I1 
duration) and therefof^Mrof portability and dialect variations. In cases of under-resourced conditions, 
even several hour^^piplech are unaffordable, let alone 40I1/4GB. Building a corpus-based TTS engine 
would therefore infra Jble. 




Training phase 



Synthesis phase 





Figure 6. HMM-based speech synthesis. 



Figure 7. Keyword spotting FSM. 



Complying with the condition, we adapt the TTS core to parametric synthesis - the HMM-based synthesis 
[12] which has the advantages of lightweight storage and smooth prosody. Figure 6 gives an outline view on 
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the ITS flow. Required data can be as low as 45 minutes of speech and highly natural voice could be 
achieved at -2 hour level. Sample speech for training the models was collected from three candidate 
speakers of Saigon, Hue, and Hanoi dialects, providing 3 different voices for user options - the first 
Vietnamese ITS system to be capable of. 




D. Dialog Manager 

Acting as the brain of an IVR system, dialog manager controls the calling sessions. A preset communicate 
script is enforced on each session. It starts with a welcome message and waits for user response. Users wi 
ask an agricultural question in a very natural sense. The dialog manager is bound to find out and spe^pMck 
an appropriate answer. Users could also choose to ask another question from this step until jSD&m is 
confirmed. £ 

C \ 

In contrast to voice commands by keywords, our system provides a flexible means for vc3^»nfhiunication 
by natural language interaction. One can ask a question as they would do to a person^ fltsa^ing something 
such as "What is borer?" or "How can I fight locust?" or simply "The ravage of bore^hpe dialog manager 
will fetch back the appropriate answer. To achieve this goal, a keyword spotting^fay^Snism is proposed to 
pick out important terms from a complete sentence. Let A be the set of agricultufc/keywords and B stand 
for the set of grammar terms. The finite state machine (FSM) depicte^^/R^ufe 7 is used to render the 
keyword spotting functionality. In this sequence, agricultural term^^^r^dw 
while grammar terms (B) are optional and can be disposed of. If the 4 
will be assumed. 



agricultural termi^^^^dways required for queries 
e ^m^^tate is not reached, a null query 



III. Experimen, 

This Section focuses on the evaluations of the ASR engi. 
them are conducted on the dataset described below. 




online trial, and runtime response. All of 



A. 

Speech Corpora 





Duration 


^Speakers 




27 hours 


18 


Agricu«ra^Telephony (AT) 


7.2 hours 


62 



We first collect the 
represent for the far: 
(including keywords, 
we compile it tog^ 
identical formal 
energy, plua^hfcr 




telephony speech corpus from 62 speakers of Mekong Delta which 
Total duration is roughly 7.2 hours with a vocabulary size of 103 words 
r terms and confirmation words) several of which are listed in Table II. Next, 
the VOH corpus to evaluate the recognizer. Both corpora are converted to an 
KHz, 16 bits, mono. They are further parameterized into 12-dimensional MFCC, 
a and acceleration (39 length front-end parameters). 



TABLE II. 



Lexicon Samples 



bac 


mau 


chau 


chau 


cuon 


la 


chay 


vang 


ling 


giong 


nhiem 


khuan 


lam 


dong 


ngap 


trang 


sau 


dom 


vong 


than 


cho 


U" 


ok 


toi 


hay 


sai 


dung 


roi 


vang 


khong 
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The corpora (shown in Table I) are then divided into subsets for training and testing 2 target ASR engines 
(i.e., the baseline and SGMM systems). Table III summarizes the training and test sets devised for 
experiments. 



TABLE III. 



Training and Test sets 





Training 


Test 


hours 


corpora 


hours 


Corpora 


Baseline 


6.2 


6.2I1 Agriculture (AT) 


1 


lh AT 


SGMM 


33-2 


2 7 h VOH + 6.2I1AT 


1 


lh AT 



Language models (trigrams) for the recognizers are built by interpolating individual models ^ 
the Web text corpus and the training data's transcriptions. - ^ 



from 



TABLE IV 



B. Transcription Evaluation 

Transcription performances 





Baseline 


SGMM 


%WAR 


90.26% 


93.04% 



In this experiment, the recognizers are evaluated on the task ofsleeyt transcription. Performances are 
reported for two different systems: baseline and SGMM. The basefoffjecognizers are based on conventional 
3-state left-to-right HMM triphone models, with 18 Gaussifans^ffer state. The SGMM system's shared 
parameters are estimated using data from both VOH ^nfr^T, while the state-specific parameters are 
trained on AT data only. An SGMM configuration witmCoo^shared Gaussian components (I = 400), 40- 
dimensional state-vectors and 12 sub-states per state 

Table IV summarizes the performances of the^^^gnizers. Using SGMM, an absolute improvement of 
2.78% WAR over the baseline is achieved. Tjil^esurts confirm the benefit of SGMM in taking advantage of 
resources in other domains whenever onl^^ltoan amount of training data is available. 

IVR Online Trial 



As expecte 



?cted 




For online trials, the whole I\^R sys^nTwas deployed in a real data center which connected to a telephone 
network. Users use their moWfft^Dhone to dial the system number and interact with our voice server. We 
ask 30 volunteers each tojnmmJb calling attempts separately. That means users don't need to get used to 
the system and they g^A^jl to speak whatever they want in terms of agricultural query. Each calling 
session is processed 1w]\th ASR engines (i.e., baseline and SGMM) simultaneously. Results, in accuracy 
rates, can be seen ip^igSre 8. 



gains the upper hand over the baseline, but subdued to its own transcription 



performa^^l^fer losing utterance constraints in online trials, the score decreased approximately 2.62% 
^rrp^red 



when 
that 



to transcription tests. However, the average WAR of 90.42% in the online trial indicates 
oposed system could be an effective call center for agricultural extension services. 

D. Runtime response 



As an agricultural extension service, its response time is crucial. In order to be deployed, response timing 
must be real-time equivalent or even better. This experiment measures the running time for each 
communication session, including both ASR and TTS computations. The same 30 volunteers who 
participated in the online trials are asked to communicate with the server using random utterances. 
Processing durations are logged and an average response time of 2.349 seconds can be derived. Figure 9 
plots the timing performances of the first 50 loops. 
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Figure 8. IVR performances. 
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Figure 9. Runtime performance; 



IV, Conclusion 




Editions. Should we 
Tways there for us. 



This paper has presented a critical enhancement for voice server on under-resour 
run low on corpora, SGMM -based ASR and HMM-based TTS techniques 
Experimental results did confirm the hypothesis. 
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